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Abstract 

In this work we describe a new approach using relative 
debugging to find differences in computation between a se- 
rial program and a parallel version of that program. We use 
a combination of re-execution and backtracking in order to 
find the first difference in computation that may ultimately 
lead to an incorrect value that the user has indicated. In 
our prototype implementation we use static analysis infor- 
mation from a parallelization tool in order to perform the 
backtracking as well as the mapping required between se- 
rial and parallel computations. 


1. Introduction 

As the number of parallel computers increases, so does 
the demand for converting existing programs into parallel 
form. There are three general approaches to the conversion 
process: 

• manual translation, where explicit parallel and com- 
munication constructs are added to the code; 

• compiler-based parallelization, v here the user em- 
ploys source code directives [10, 14] to steer the com- 
piler into producing efficient parallel code; and 

• fully interactive parallelization too s, where user inputs 
steer the parallelization process [3 11]. 

Regardless of the approach used, the porting process is 
error-prone. Even in the more automatic alternatives the 
user is providing information, and it is mistakes in that in- 
formation that leads to bugs in the pa a 1 lei program. For 
example, the user might incorrectly indicate that a loop can 
be safely run in parallel by deleting ; n essential edge in 
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the dependence graph. The resulting program may have 
race conditions in a shared-memory version, or use values 
of variables that are not up-to-date in a distributed-memory 
version. 

Finding bugs introduced during parallelization can be 
very complicated. Often the user must manually run the 
serial and parallel programs side-by-side to try to find out 
where the two computations differ. This technique likely 
involves numerous executions of the programs in an attempt 
to locate the first difference that potentially led to all subse- 
quent differences. 

In this paper we describe our approach for automating 
the manual debugging technique described above, as ap- 
plied to distributed- memory programs. We begin by dis- 
cussing our general approach, including the information re- 
quired and the difference detection algorithm. In Section 
3 we describe our prototype implementation, and following 
that we give an example of its use. In Section 5 we discuss 
ways to extend our work. We then discuss related work and 
draw conclusions. 

2. Finding the First Difference 

The manual technique of running the serial and paral- 
lel programs side-by-side involves deciding where to put 
breakpoints and, at breakpoints, deciding what values to 
compare. For example, if the parallel program is printing 
a wrong value, the user can look at the source code and fol- 
low enough statements backward from the print statement to 
see where the incorrect value was calculated. Fie could then 
insert a breakpoint at that definition point and re-execute the 
program to see what values used in the calculation are in- 
correct. He can repeat this process of following statements 
backward until the source of the error is reached. 

Automating this search for the first difference between 
the computations of a serial program and its parallelized 



version requires two important element 1 : 

1. a method to drive the search that decides which value 
comparisons are useful, and 

2. the ability to make comparisons be ween the serial and 
parallel for the values we choose to compare. 

The first element can be provided by combining the user’s 
observations of program behavior wild data dependence 
analysis , which describes how values get created and used 
within a program. The possible definition points of a known 
bad value can be found, and further com orisons can be per- 
formed on the values involved in those definition points. 

The second element requires determining how the serial 
computation has turned into a parallel one. This computa- 
tion mapping answers the important coi iparison questions: 

• Where to compare — the prograi i statements where 
the serial and parallel processes should be instru- 
mented to perform comparisons; 

• In whom to compare — the proci sses in the parallel 
execution containing the values to >e compared; 

• When to compare — the iteration count at which the 
comparison should be made in each process; and 

• How to compare — the comparison function that de- 
scribes how to construct the values to be compared 
(e.g., obtain the checksum of a distributed array), and 
then describes how to determine “equality” of the val- 
ues. 

In the remainder of this section we d scuss the computa- 
tion mapping information we need, then describe our algo- 
rithm and how it makes use of this information. 

2.1. Computation Mapping 

The computation mapping from a erial program to a 
parallelized version of that program can be broken down 
into several components. Here we discuss each component, 
and the manner in which they facilitate answering the com- 
parison questions listed earlier. 

• Source -to- source mapping can be used to answer the 
“where to compare” question. It is a description of lo- 
cation correspondence in the serial and parallel source 
code where we can expect values to be comparable. 
For example, we may want to place instrumentation 
breakpoints at one line in the serial code and at a corre- 
sponding line in the parallel code. Ins mapping infor- 
mation likely goes beyond simple line correspondence, 
however, since the parallel source t ode may differ sig- 
nificantly from the serial source co ie 


• Execution mapping answers the “in whom to compare” 
question. This information builds on the source map- 
ping given above, and describes which parallel pro- 
cesses actually execute the program statements in the 
parallel source code that map to a certain location in 
the serial source code. 

For example, if the serial program performs an initial 
phase of file I/O, a typical execution mapping will in- 
form us that the corresponding file I/O code in the par- 
allel program is confined to the first process. Often this 
mapping can only be determined at runtime, when the 
number of parallel processes is known and the compu- 
tation has been split up among those processes. 

• Iteration Mapping answers the “when to compare 
question. If a value we wish to compare is located in a 
program statement that is executed more than once, we 
rely on iteration mapping information to tell us which 
of those executions should be instrumented to compare 
the value. Multiple executions of the same statement 
may occur because of loop constructs, or perhaps be- 
cause of multiple subroutine executions (e.g., recur- 
sion). 

• Data value mapping is used to answer the “how to 
compare” question. This mapping builds on the execu- 
tion mapping by providing a description of how' serial- 
side values may be represented in the parallel computa- 
tion, and also provides an equality function with w'hich 
to compare values. For example, if the computations 
perform a sum reduction on a vector, the serial pro- 
gram may have a variable S w'hose value is actually 
the sum of the S values found in the parallel processes. 
Furthermore, when comparing the value of variable S 
in the serial process to the sum of S values in the par- 
allel processes, we might consider them equal if they 
agree within some tolerance that accounts for differ- 
ences in numerical method. 

A notable subset of this mapping is data distribution 
information. The description of how variables, ar- 
rays in particular, are distributed across multiple ad- 
dress spaces is important for debugging parallelized 
programs. For example, it must indicate how p a serial- 
side array index expression gets mapped into a process 
number and a list of indices in the parallel computa- 
tion. If data decomposition is employed in the paral- 
lel computation, the data distribution description must 
contain information about “ghost points” and other is- 
sues relating to data on the boundaries of the decom- 
position. 

We next describe how these pieces of information are 
utilized to produce an effective approach to relative debug- 
ging- 



2.2. Algorithm 

In this work we have automated the n lanual technique for 
comparison debugging that was descril ed at the beginning 
of this section. When a user indicates a bad value in the 
execution of the parallel program, we pc rform the following 
steps. 

(1) Find the possible definition poi its of the incorrect 
value using dependence analysis information. 

(2) Examine the variable reference ; on the right-hand 
sides of those definitions to dete rmine a set of sus- 
pect variable references to monil >r in a re-execution. 

(3) While there are new suspect variable references to 
monitor 

(3a) Instrument the suspect va iable references in 
both the serial and parallel versions of the pro- 
gram. 

(3b) Execute the instrumented programs, stopping 
when a difference (i.e., a bad value) is detected. 

(3c) If the bad value has not been seen before then 
use it to determine a new set of suspect vari- 
able references to instrument (as was done in 
steps (1) and (2) above); otherwise allow loop 
to terminate. 

When the loop stops we have ident fied the first differ- 
ence between the two computations that may lead to the bad 
value identified by the user. While this strategy works, we 
can improve upon it by using a limited 1 )rm of backtracking 

[2] , In particular, if we can determine fie value that a new 
suspect variable would have at a potential instrumentation 
point, we can avoid the re-execution th; t takes place in step 

( 3 ) . 

Although backtracking improves ou efficiency substan- 
tially, its typical implementation — checkpointing program 
variables before they get overwritten during execution — is 
too costly. Instead, we are satisfied to limit the scope of our 
queries about previous program states to those that can be 
answered using only information in th? current state. If a 
variable value created at one point in the program has not 
been killed by subsequent execution, w • can evaluate it and 
determine either that it is OK or bad. If it is OK, then we 
can ignore it and concentrate on other values. If it is bad, 
we can look for values used in its definitions (as is done in 
steps (1) and (2)). If a variable value has been killed we 
can simply add the variable to the list i I suspect references 
to be monitored during a re-execution. 

For example, suppose we have deter nined that the value 
of w3 is bad in line £ 5 in Figure 1. Suppose as well that 
the definitions of w3 that may reach tha; point are at £2 and 


L\\ 

w2 = rl + wl 


if ( ... ) then 

L 2 : 

w3 = r2 + w2 


else 

Lf. 

w 3 = r 3 + ul 


endif 

La 

ul — 0 

L 5: 

x = r 4 + w 3 


Figure 1. Simple backtracking. 

L 3 . In that case, we will attempt to evaluate the variables 
r2, w2, r3, and ul 1 . If we cannot evaluate one, say ul, 
because its value is killed between lines £3 and £5* t hen we 
will add the use of ul at line £3 to the list of suspect refer- 
ences that need to be instrumented in the next re-execution. 
If we evaluate a variable, for example r2 or r3, and find 
that the value matches the serial value, then we can ignore 
it. If the value of one, say w2 in line £2, is not the same, 
then we look at the definitions of w2 that reach the use in 

£2. In this case suppose the only definition is at line £1. We 

would then look at the variables rl and wl to see if they 
need to be added to the list of suspect references. Suppose 
rl is correct and that wl is incorrect. Then the algorithm 
will proceed by evaluating the definition points of wl that 
reach line L\. 

We can potentially further reduce the number of re- 
executions by speculatively examining the definition points 
of unknown values. In the example given above, the value 
of ul in line £3 is killed by the assignment at line £4. Since 
we couldn’t directly determine the value used at line £3, we 
simply instrumented the use of ul at that point in order to 
do a comparison in the next re-execution. However, we can 
look at the definitions of ul that reach line £3 to see if any 
of them use bad values. If we find a bad value, we can 
continue the search for differences at that point, potentially 
saving us a re-execution of the program. 

Note that the accuracy of the dependence information 
plays an important role in the efficiency of the algorithm. In 
particular, having accurate interprocedural information will 
allow us to examine backwards through procedure calls. 
Consider the example in Figure 2. Suppose we find out 
that the value of w7 is incorrect at line L%. Accurate in- 
terprocedural information could tell us that the statement at 
line £9 might define the value of w7 that reaches £ 8 . (Note 
that the test for a definition kill of the variables used at £9 
must check for kills from £9 to £10 as well as check from 
£7 to £ 8 .) If no kills are found, we can evaluate b and r7. 

1 In this and the following example, we use a mnemonic for the reader’s 
convenience: the names of variables with wrong values begin with a w, 
right values with an “r,” and untestable values with a u. 


3. Prototype Implementation 


L§\ w6 = u6 + u" 

L 7 : call sub (w7 w6) 

L%\ x = w7 + r6 

end 

subroutine sub (a, b) 

Lg : a = b + r 7 

L io- end 

Figure 2. Interprocedural backtracking. 

Suppose we find that r7 is correct, bi t b is not. The in- 
terprocedural information could tell us ’hat the statement at 
L 6 defines a value for b that might rea« h Lg. We can then 
continue the backtracking by looking at the variables on the 
right-hand side of i 6 . In the absence < f accurate interpro- 
cedural information we would have had to treat the call at 
L 7 as a potential definition, and use, of all actual parame- 
ters as well as any global variables (i.e , those in common 
blocks in Fortran.) In that case we might have had to instru- 
ment every statement in sub in order to find an erroneous 
definition of w7. 

Although our examples so far have u *>ed only scalar vari- 
ables, our approach also works when subscripted array ex- 
pressions are present. Having the resul s of a sophisticated 
data dependence analysis phase is critical to being able to 
follow the flow of values in arrays from their definition to 
their use. 

In summary, the algorithm we have proposed utilizes in- 
formation from static analysis in three ways: 

1 . to find the possible definition point s of a value V being 
used at some location L, 

2. to enumerate the values used in 1 definition at some 
location L, and 

3. to determine if a value V which is used in location L\ 
is still available in the program sta e at location L 2 . 

For the correctness of our algorithm, t le answers to these 
questions must be conservative. In paricular, we must get 
all of the possible definition points in ( ) and all of the val- 
ues used in a definition in (2). For (3 1 , we must only get 
a “yes” answer if the value definitely anvives. The more 
accurate this static information is, the mire efficiently our 
difference finding algorithm will perform. Overly conserva- 
tive responses could result in extra instrumentation points or 
re-executions. 


We prototyped the algorithm of the previous section by 
extending two existing tools to cooperate with each other. 

• Computer Aided Parallelization Tools ( CAPTools ) is 
a parallelization tool from the University of Green- 
wich [5, 11]. The user assists the tool in converting 
serial programs into a form suitable for execution on 
a distributed-memory machine. It performs sophisti- 
cated dependence analysis, partitions array data, and 
inserts needed communication calls. 

• P2d2 is a debugger for parallel programs from NASA 
Ames [6]. It is portable across a variety of parallel ma- 
chines and its user interface scales so that it is capable 
of debugging 256 processes or more. 

In this section we first describe how these existing tools 
were extended to produce the prototype. We then discuss 
some shortcuts we took in the implementation in the inter- 
est of speedy development. 

3.1. Putting the Pieces Together 

In the prototype, CAPTools is used to create the paral- 
lel program whose correctness is dependent upon correct 
user interactions. The results of dependence analysis, data 
partitioning, and parallel communication insertion are de- 
posited into a database. A library encapsulation of CAP- 
Tools , with modifications to accommodate the information 
needed in this prototype, is used as the comparison algo- 
rithm proceeds. 

P2d2 acts as the user interface. The user is queried for 
an initial variable name and a location where that variable 
is used but appears to have the wrong value. P2d2 runs the 
serial and parallel programs to verify this difference exists, 
then 

• retrieves a list of instrumentation points from CAP- 
Tools , 

• inserts breakpoints in the serial and parallel programs 
at those points, and 

• reruns the programs, checking the appropriate values 
at each breakpoint. 

These steps are repeated as long as new, earlier, differ- 
ences of interest are found in the values being checked at 
breakpoints. If a re-execution results in the same first dif- 
ference detected, then the algorithm terminates. 



To perform the first step above, p2d 2 holds a conversa- 
tion with CAPTools. The debugger initiates by requesting 
from CAPTools a list of instrumental o i points that might 
lead to the bad value just found, where ihe bad value is de- 
scribed with the pair 

{variable name, location in soi res code) 

for the serial version of the program. In response, CAP- 
Tools examines its database of dependei ce information, us- 
ing the algorithm described in the previous section, looking 
for the possible definition points of the had value. It handles 
this even if the value flows across proceduie boundaries and 
gets remapped in going from caller to ca lee. For each of the 
definition points it uses a range of depe idence information 
throughout the relevant call stacks to del ermine if the values 
used in the definition are still live, and then makes requests 
of p2d2 for different evaluations to determine the correct- 
ness of values in the parallel processes. 

Evaluations in the serial program are straightforward. In 
the parallel program, however, there ate varying levels of 
complexity to overcome in determining the data value map- 
ping needed for performing evaluations In the simple case, 
a value may be duplicated in all paralh 1 processes, or per- 
haps in a statically defined subset of tin se processes. CAP- 
Tools need only indicate to p2d2 the processes of interest, 
and the stack frame in each process whore the value can be 
found. A more complex case involves a dynamically de- 
fined mapping from a serial value to a \ arallel value. CAP- 
Tools must then provide p2d2 with a description of the map- 
ping in order to facilitate such evaluations. For example, a 
request for an evaluation of an array that is partitioned in 
the parallel program would include a description of the par- 
tition boundaries that hold for each par; llel process. 

When CAPTools has finished its examination of the static 
and dynamic information, having exha .tsted all backtrack- 
ing possibilities from live variables, it returns to p2d2 a list 
of 3-tuples 

(variable, location in source code, process) 

that need to be instrumented in a rerun ». >f the serial and par- 
allel programs. Each entry in this list takes into account the 
computation mapping from serial to parallel, using previ- 
ously evaluated dynamically defined n apping attributes if 
necessary. The conversation between < l APTools and p2d2 
for the first step of this iteration of th ‘ algorithm ends at 
this point. The remaining steps of inserting breakpoints at 
the instrumentation locations and reru ming the programs 
are then performed by p2d2 . 

When there are no new interesting d fferences found be- 
tween the computations, p2d2 returns to the user the most 
recent difference discovered by the algorithm, which is the 
first that occurs between the computaiions that may have 
caused the bad value reported by the us;*r. 


3.2. Limitations of the Prototype 

In the interest of speedy implementation, we limited the 
need for the computation mapping information described in 
Section 2.1. 

• We temporarily eliminated the need for a source-to- 
source mapping by restricting comparisons to compu- 
tations of programs that have the same source code. 
Essentially we take a parallelized program that has cor- 
rect behavior when run with one process but incorrect 
behavior when run with N processes, and compare the 
two computations. 

• The prototype does not yet perform sophisticated iter- 
ation mapping. The test executions that we are mak- 
ing are constructed so that the mapping is either not 
needed, or the effect of not having the mapping is min- 
imal (i.e., we know what the result would be if we did 
have iteration mapping). 

• The prototype does only a partial job of data value 
mapping. CAPTools takes care of mapping variable 
references from the sequential execution to their loca- 
tion in the distributed address space. We do not yet, 
however, perform any comparisons other than tests on 
simple expressions. For example, we do not aggregate 
values from the parallel execution into a single value 
for comparison against a value in the sequential run. 
In addition, we currently perform only strict equality 
tests of the values being compared. Furthermore, the 
tests are done on the ASCII strings the debugger uses 
to print the values, not the bit representations of the 
values. 

In Section 5 we describe our plans for removing these re- 
strictions. 

4. Example 

In order to illustrate the power of automated relative de- 
bugging, consider the following scenario. The user has par- 
allelized a serial version of the NAS Parallel Benchmark 
program LU [12] using CAPTools , implementing a 1-D de- 
composition in the “J” dimension. In doing so, he inadver- 
tently introduces errors in the parallel program by instruct- 
ing the parallelization tool to ignore some dependences. 
When the resulting parallel version is executed using just 
one process, program output exactly matches that of the se- 
rial version and every verification test in the benchmark suc- 
ceeds. Executing the parallel version with two processes, 
however, results in different program output and failure of 
every verification test. 


subroutine ssor ( ) 
do k=nz- 1 , 2,-1 


call buts (v, k) 
enddo 

return 

end 

subroutine buts(v,k) 

call cap_receive (v { 1 , 1 , hi gh+1 , k) , 

. . . , cap- right) 

do j=high, low, -1 
do i=nx- 1 , 2 , - 1 
do m=l , 5,1 

tv (m, i , j ) = /(v,tv) 
tmat(m,l) = d(m,l i.,j) 

enddo 

tv (1 , i , j ) = tv(l, i, i ) /tmat (1, 1) 
v { 1 , i , j , k) = v(l,i, i, k)-tv(l,i, j) 

enddo 

enddo 

call cap_send (v (1 , 1 , low ,k) , , cap-left) 
return 
end 

Pipelined ( unmodified dependences) 


subroutine ssor ( ) 
do k=nz-l , 2,-1 

call cap-exchange {v(l, l,high+l,k) , 
v ( 1 , 1 , low, k) , 

. . . , cap-right) 

call buts (v, k) 
enddo 

return 

end 

subroutine buts(v,k) 


do j =high , low , - 1 
do i=nx-l, 2, -1 
do m=l , 5 , 1 

tv (m, i , j ) = /(v, tv) 
tmat(m,l) = d(m,l,i,j) 

enddo 

tv { 1 , i , j ) = tv(l,i,j) /tmat (1,1) 
v ( 1 , i , j , k) = v(l,i,j,k)-tv(l,i,j) 

enddo 

enddo 


return 

end 

Fully parallel ( removed dependences) 


Figure 3. Effects of dependence modifications on parallelization. 


4.1. User Interaction with the Prototype 

The user first looks at the verificatii n subroutine to find 
what values might have caused failures in the two-process 
execution. One of the verification tests is a check of whether 
a scalar variable in the program has the correct value, and so 
this variable and source code location is entered into p2d2 
to begin the relative debugging algorithm 

After our prototype has performed ive iterations of in- 
serting instrumentation and re-running the one-process and 
two-process executions, it reports that the first difference 
between the computations is in the array tv at line 272 of 
subroutine buts. (Note that the first difference the proto- 
type should have reported is in the array v at this location. 
See the next subsection for our discussion of this.) 

With this information in hand, the u .er opens CAPTools, 
loads the database of information gem rated when the par- 
allel version of the benchmark was created, and begins to 


scrutinize the data dependence choices that were made for 
subroutine buts. CAPTools reports that the original analy- 
sis of the serial source code, before any user modification of 
dependences, indicated the subroutine might be amenable 
to partial parallelization using a pipeline, but not to a full 
parallelization. During the parallelization process, however, 
the user erroneously removed dependences for several vari- 
ables (including tv) within the subroutine. This allowed 
a full parallelization without the need for communication 
within the subroutine body to update values. See Figure 3 
for a code comparison. 

These modifications are plausible candidates for the 
cause of the incorrect behavior. A parallelized version of 
the subroutine would behave similarly to the serial version 
when executed with one process, because the degenerate 
case of one parallel process does not encounter the difficulty 
of stale values. If a dependence modification was indeed 
made incorrectly, then stale values could become a problem 




Figure 4. Portion of call gr aph of LU. 

when two or more processes were used in the parallel execu- 
tion. Parallelizing the serial source codr w ithout these user- 
introduced errors generates a parallel p ogram that behaves 
correctly when executed with any number of processes. 

4.2. Behind the Scenes 

Our tests with the LU benchmark were conducted with 
the “sample” class size (i.e., class S). Ii uses a 12 x 12 x 12 
mesh and is executed on 2 processes Thus, for a three 
dimensional array A that represents the entire mesh, pro- 
cess 0 owns A ( 1 : 12 , 1 : 6 , 1 : 12 ) and process 1 owns 
A ( 1 : 12 , 7:12,1:12). 

The automatic debugging algorithm starts by examining 
the definition points of the user-indicatrd incorrect value of 
variable xci in routine verify. (S. v e also the relevant 
part of the call graph shown in Figure 4). The variable is 
not defined there, but rather is an argument passed in, so the 
search moves to the calling routine. This in turn leads to the 
routine pintgr in which the variable < ne w named f rc) is 
defined. The definition statement is: 

f rc=0 . 25d+00 * (frcl+fr; 2-frc3) 

where frcl, frc2, and f rc3 are not overwritten before 
the current execution state (i.e., at a bre; kpoint in verify). 
However, they are variables local to p intgr and since it 
has already exited, they cannot be evaluated. Therefore, an 
instrumentation point is set in routine pintgr at the point 
where f rc is set and a re-execution is performed. 

On reaching the definition of fre, the value of frc3 
in the parallel execution proves to be i icorrect when p2d2 
compares it to the serial value. The St arch therefore con- 
tinues for its defining statements. One is found in the state- 
ment: 

f rc3=deta*dzeta*f rc3 

where the values of deta and dzeta prove correct. The 
incorrect value must therefore be f rc L However, the old 
value was overwritten by this very assignment and thus can- 
not be tested. Rather than issuing and her instrumentation 
for this variable (especially since this s the statement im- 
mediately prior to the previous instrumentation point), the 


definitions of the f rc3 used here are inspected. This brings 
us to a communication call that performs a global summa- 
tion of this variable, so the definitions of f rc3 used in this 
summation are then inspected. The definition encountered 
is: 

f rc3=f rc3+ (phil ( j , k) +phil ( j+1, k) + 

phil ( j , k+1) +phil { j +1 , k+1 ) + 
phi2 ( j , k) +phi2(j+l,k) + 
phi 2 ( j , k+1 ) +phi2 { j +1 , k+1 ) ) 

and the arrays phil and phi 2 are inspected to determine 
if they are correct. When more than one location in an array 
is wrong, we find the “most incorrect” element in the array 
on any process. In this case, both phil and phi2 prove 
incorrect with the “worst” values identified being: 

phil (6,7) (on process 1 ) 
phi2 (6,6) (on process 0) 

The search then continues for the definition of these incor- 
rect values, leading to the statement: 

phi2 (j , k) = c2* (u ( 5 , if in, j , k) - 
0 . BOd+OO* 

(u (2 , if in, j ,k) * *2 + 
u (3 , if in, j , k) **2+ 
u (4 , if in, j , k) **2) / 
u ( 1 , if in, j , k) ) 

still in routine pintgr. Evaluation establishes that c2 is 
correct and that if in = 11, so the search then focuses on 
u(l:5,ll,6,6) on process 0. 

In subroutine pintgr the array u is in a common block. 
The definition of u is found to be in routine ssor at the 
statement 

u (m, i, j , k) =u (m, i, j , k) +tmp*rsd(m, i, j , k) 

where all used variables are potentially overwritten before 
reaching the state available to p2d2, so instrumentation 
points are set and a re-execution performed. 

On reaching an instrumentation point, we search for the 
statements that define rsd (1,11,6,6) on process 0. It 
is defined in routine buts at the statement: 

v ( 1 , i , j , k ) =v ( 1 , i,j,k) - tv ( 1 , i , j ) 

The values of tv and v cannot be tested in the current state, 
so we search for their definition points. Array tv is defined 
in the statement: 

tv ( 1 , i , j ) =tv ( 1 , i , j ) / tmat (1,1) 
where, although tmat cannot be checked, it is defined as 
tmat (m,l)=d(m,l,i,j) 







within the same i and j loops. Furthermore, the value of 
d ( 1 , 1 , 11 , 6 ) on process 0 proves a rrect so tmat does 
not need to be instrumented. Obviousl v. the used value of 
tv has been overwritten, so an instrumentation point for 
tv is set and a re-execution performet . The definition of 
tv that is then located is the assignment 

tv (m, i , j ) =/(v,tv) 

in Figure 3, and this is the statement tl e prototype reports 
to the user as the location of the first deference. 

The current implementation of the a gorithm reports the 
use of tv in the statement as the problem. The actual prob- 
lem, an incorrect value for v (1, 11, 7 , 6 ) on process 0 
due to the missing pipeline communication, will be found 
when the prototype has the iteration mapping information 
described in Section 2.1. This extension to the prototype is 
discussed in Section 5.1. 

For simplicity, the above example operation of the al- 
gorithm omits many other successful variable compar- 
isons and instrumentation points that were never reached. 
These were essential to ensure that the problem would not 
be missed, but proved unnecessary as the algorithm pro- 
gressed. 

5. Extending the Prototype 

While the prototype described in Section 3 establishes 
the proof of concept of the use of backtracking and re- 
execution in the debugging of parallelized programs, it 
doesn’t yet meet our vision for a comp ehensive automatic 
debugger for parallelized programs. In this section we de- 
scribe how we plan to extend the functi >n ality of the proto- 
type to handle a wider variety of problems. 

5.1. Iteration Mapping 

Our current prototype assumes that o lly a trivial iteration 
mapping is needed to compare the serial and parallel com- 
putations. Any statement that is executi d more than once is 
assumed to be instrumentable in both tl e serial and parallel 
programs without having to worry about aligning iterations. 

In terms of loop iteration matching, a first-order approx- 
imation of this mapping will describe 1 »op transformations 
and unrolling performed at the source code level. More 
precise approximations might also inc ude those loop op- 
timizations performed by the compile s used to generate 
the serial and parallel executables. In addition, the itera- 
tion mapping will need to describe proi ram statements that 
are executed multiple times because of epeated calls to the 
same function. 

Once an iteration mapping is produced there are still fur- 
ther issues to consider. In our prototype, CAPTools will 
communicate to p2d2 the necessary iteiation conditions for 


each instrumentation point. These conditions can then be 
checked each time the instrumentation is triggered. For a 
nested loop this might be unacceptably slow, though, and 
to solve this we could instrument locations progressively 
closer to the actual location. For example, instrumentation 
could be placed in the outer-loop of the loop nest to check 
when its iteration condition is met, and then each successive 
inner-loop would be instrumented in the same manner until 
our desired location and iteration is reached. 

Additional improvement can be obtained by looking 
into different types of instrumentation. Conditional break- 
points offer an easy way to accommodate iteration condition 
checking, but typically execute much slower than the pro- 
gram itself. Instrumentation that runs at full program exe- 
cution speed (such as offered by Dyninst [4]) is desirable, 
assuming the required types of condition checking can be 
implemented. 

5.2. Different Program Sources 

In the interest of minimizing implementation time, we 
limited our prototype to comparing one-process and multi- 
process executions of the same program. That is, given an 
MPI code produced by CAPTools , we can compare a single 
process execution with a multiprocess one. The user may, 
however, be more interested in comparing an execution of 
the original serial program (with no MPI calls in it) to a 
multiprocess execution of the MPI version. 

There are two issues that arise if we are to relax this lim- 
itation of our prototype. First, the parallelization tool will 
need to provide the source-to-source mapping discussed in 
Section 2.1. This mapping may be difficult to produce, be- 
cause, for example, the parallelization tool may have in- 
troduced new functions as a result of either outlining or 
cloning. 

A second issue that comes up concerns the order of exe- 
cution of statements in the two versions of the code. While 
this issue has been studied by Watson and Abramson in 
Guard [16], the addition of backtracking introduces further 
complexity to the situation. Consider the following two sit- 
uations addressed by Guard. 

• If we have multiple instrumentation points in the pro- 
gram, there may be no guarantee that they will be en- 
countered in the same order in the serial and parallel 
executions. The problem then is that we must con- 
tinue past one of the instrumentation points in order 
for the execution to make progress. By continuing we 
may destroy values that are needed for comparison at 
the point when the second execution reaches the cor- 
responding point. Thus, the implementation must take 
care to checkpoint information required for compari- 
son. 



• Temporal displacement concerns t ic possibility that a 
collection of values in one progra n may not exist all 
at the same time during execution of the other pro- 
gram. This situation can arise, f< r example, as a re- 
sult of scalar expansion or loop fusion. To address this 
situation Watson and Abramson si ggest an array con- 
structing technique that collects and checkpoints the 
values needed in a comparison. 

In Guard, the values being compared a e explicit. Thus, if 
checkpointed values or constructed arrays are needed, it is 
clear which values should be saved. If we add backtracking 
to relative debugging, the values that we may want to ex- 
amine are not known a priori. Instead as our comparison 
algorithm backtracks we determine the values of interest 
The question we need to address, therefore, is which values 
should we checkpoint? One extreme vould be to check- 
point an entire program state. Alternati ely, we might want 
to anticipate the backtracking that will be done and only 
checkpoint values that might be investigated. 

5.3. Manually Parallelized Programs 

In order to extend the prototype to manually parallelized 
programs, we need to acquire the type' of information de- 
scribed in Section 2 that are currently provided by the paral- 
lelization tool: data dependence and computation mapping 
information. There are two possibilities for each class of in- 
formation. Either we can try to acquire the information au- 
tomatically or we will need to ask the user for it. In our es- 
timation, automatic methods are clearly preferable, because 
the user will likely find the process of pr >viding information 
to be tedious and he may make mistake 

Fortunately, it seems straightforward to collect the de- 
pendence analysis information. By running a variant of the 
CAPTools analysis phase on both the serial and parallel ver- 
sions of the code, we should be able to insw^er the dataflow 
questions that come up during the relative debugging. 

It is not as straightforward to collect the computation 
mapping. For example, acquiring the source mapping infor- 
mation automatically may be an interesting research prob- 
lem, depending on the similarity of the serial and parallel 
programs. If, for example, functions ii the serial code are 
present in the parallel code, then there is a natural starting 
point of correspondence. It may be per sible to use incom- 
plete mapping information that is automatically derived to 
bracket an error, and then rely on the user to provide more 
complete mapping information in ordi r to zero in on the 
bug. 

Determining the other mapping information automati- 
cally seems more problematic. There has been previous 
work done on straightforward ways fo a user to describe 
the data distribution [7, 9, 15, 17]. Similar approaches may 
work for other aspects of data value mapping. When the 


data value mapping is combined with dependence analysis 
information we may be able to construct execution and iter- 
ation mapping automatically. 

5.4. Shared-Memory Parallelism 

One of the restrictions we placed on the prototype imple- 
mentation concerned the parallel programming paradigms 
supported. In particular we limited it to handling only MPI 
programs 2 . In the future we would like to relax this restric- 
tion by providing support for shared-memory parallel pro- 
grams, such as OpenMP programs. 

One obstacle to relaxing this restriction is in the imple- 
mentation of p2d2. The debugger uses a client-server archi- 
tecture [6], and the current implementation of the server is 
layered on top of gdb , the debugger from the Free Software 
Foundation. Unfortunately, gdb does not support a full set 
of thread control operations. For example, there is no gdb 
command to single-step one thread (and leave others where 
they are). 

In order to compare the computations of an OpenMP 
program and its serial counterpart, we must be able to con- 
trol individual threads in the OpenMP program. To do this 
using the gdb command set, we need to be able to hold some 
threads when we continue others. One way to do this is to 
modify the program counter of each thread to be held so that 
it effectively busy waits while the non-held threads execute 
their normal instruction stream. We have tested this tech- 
nique in a prototype debugger server and it seems promis- 
ing. Some work remains to make it robust enough to use in 
the general case. 

Besides the underlying debugger work needed, our pro- 
totype would need to find sources for the dependence anal- 
ysis and computation mapping information currently pro- 
vided by CAPTools. Fortunately there is a tool, CAPO [8], 
which is based on the CAPTools code base and can generate 
OpenMP programs. It should thus be straightforward to get 
the information we need. 

There are also paradigm issues to consider in automatic 
relative debugging of shared-memory programs. In particu- 
lar, user errors, such as incorrectly indicating that it is safe 
to run a loop in parallel, could lead to race conditions in 
the program that cause nondeterministic behavior. Since it 
will be important for the relative debugging tool to produce 
consistent answers, detecting and handling nondeterminis- 
tic execution in the target code will be critical. 

2 While the prototype currently handles only MPI programs, extending 
it to other message -passing libraries, such as PVM [13], is straightforward. 
There are two issues to address: tool generation of the code and debugging 
codes of that type. CAPTools already produces codes containing calls to 
CAPLib, a generic message-passing library'. With the exception of process 
startup. p2d2 is independent of the message-passing library used. Thus, 
accommodating a new distributed communication library reduces to im- 
plementing CAPLib in terms of the library and having the library s process 
creation mechanism notify p2d2 when there are new processes to debug. 



5.5. Identifying and Correcting Bugs 

Detecting the first difference is not necessarily the end 
of the story for our relative debugging mechanism. In the 
parallelized version of a serial prograi 1 we can expect to 
encounter certain classes of bugs that ire symptomatic of 
mistakes made in manual parallelizatioi s and incorrect user 
inputs in the parallelization process. A mechanism that not 
only isolates the location of difference' , but also identifies 
the type of bug that caused the difference along with a po- 
tential corrective action, could be of tremendous value to 
the user. 

We would like to expand the scope < f our difference de- 
tection mechanism to include an analysis of the difference. 
The following are some common bug types in distributed- 
memory message-passing programs we believe can be iden- 
tified using this analysis: 

• a missing communication; 

• a communication that does not convey all required 
data; 

• a communication that overwrites i orrect data with in- 
correct data; and 

• missing computation (i.e., where a computation is per- 
formed in serial but where no matching computation 
is performed in any parallel process, perhaps due to 
errors in distributed loop limits, ele. ) 

For example, suppose that in one pi iccss of the parallel 
execution there is a variable that is first assigned a correct 
value and then used. In another process of the parallel ex- 
ecution, though, the same variable is used without being 
assigned, triggering the detection of a difference between 
the parallel and serial computations. Analysis of the defini- 
tion and use of the variable among th parallel processes 
might indicate that a communication is missing between 
those processes. 

Additionally, difference analysis in i prototype that has 
been extended as described in Section 5 4 might help isolate 
the following bug types in shared-memory OpenMP pro- 
grams : 

• invalid parallel execution of a seiial loop, producing 
nondeterministic behavior due to c ata races; 

• mis-declaring a shared variable t< be private, leading 
to data being lost when the parallel region is exited; 

• mis-declaring a private variable to be shared, leading 
to overwriting of values by other 1 treads; and 

• missing synchronization, leading to the use of stale 
values. 


The definition, use, and sharing of values could be ana- 
lyzed to identify these bugs, similarly to the way defini- 
tion, use, and communication of values might be analyzed 
in distributed-memory programs. 

6. Related Work 

Backtracking has existed in debuggers for more than 30 
years. Agrawaks thesis [2] surveys several approaches that 
roll back execution from an erroneous state, looking for the 
original bug in the program. Typically, these mechanisms 
use a combination of dependence information from a static 
analysis pass and trace information collected during an ex- 
ecution. 

Agrawal also reviews filtering techniques that use static 
analysis information. Program slicing and program dicing 
attempt to obtain the subset of a program that may have had 
an effect on a given variable. A slice reduces the search 
space for bugs, but does not provide, on its ow n, a compre- 
hensive way to isolate them. Program dicing improves upon 
slicing by narrowing down the search space of a slice using 
information about correct variables. 

There has been considerable work done in the last few' 
years on Relative Debugging [1]. For example, the Guard 
debugger [16] allows the user to indicate comparisons that 
should be made at runtime between two executions. Dur- 
ing execution, it performs the comparisons, even taking care 
of potential execution order changes between the two pro- 
grams, and stops when a difference is detected. As part of 
this work, Watson and Abramson [15, 17] detail an algebra 
for describing data distributions. 

Relative debugging has been applied previously to the 
specific problem of finding differences between serial and 
tool-generated parallel programs [7, 9]. In those efforts, 
the user indicates what variable is wrong and where it is 
wrong in the program. The debugger then queries the par- 
allelization tool to find out which routines modify the vari- 
able and under what name it is modified. With that data, 
the debugger then inserts comparison instrumentation at en- 
try and exit of those routines. When the serial and parallel 
programs are then run side-by-side, the debugger is able to 
narrow down the difference to a single subprogram. 

7. Conclusions 

The automation of relative debugging for parallelized 
programs can provide a significant reduction in the effort 
required of users to find where a serial and parallel compu- 
tation diverge. The steps typically performed by the user 
can be more quickly and comprehensively performed using 
mapping and data dependence information, potentially from 
a parallelization tool, to inform the actions of a parallel de- 
bugger. 


In this work we have described ihe implementation 
of a practical mechanism that uses backtracking and re- 
execution to automate a relative debugs, ing session. Instru- 
mentation points are determined using existing static anal- 
ysis information along with dynamically retrieved program 
state, without requiring that trace information be collected 
during executions. This allows us to av'oid both excessive 
re-execution and excessive instrumental ion by exploiting as 
much current information as possible at every stage. 

The practicality of using automated relative debugging 
continues to be of interest to us, and our prototype im- 
plementation helps to shed light on where algorithmic im- 
provements and better utilization of currently available in- 
formation can lead to further reductions in required user 
knowledge and time. This issue is of particular concern to 
the high performance community due to the great need for 
porting of large- scale, long-running programs to parallel 
forms. 
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